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We consider the testing and estimation of change-points - loca- 
tions where the distribution abruptly changes - in a data sequence. A 
new approach, based on scan statistics utilizing graphs representing 
the similarity between observations, is proposed. The graph-based 
approach can be applied to data in high dimension, and even to non- 
Euclidean data, as long as an informative similarity measure on the 
sample space can be defined. Accurate analytic approximations to 
the significance of these graph-based scan statistics for both the sin- 
gle change-point and the changed interval alternatives are provided. 
Simulations reveal that the new approach has better power than ex- 
isting approaches when the dimension of the data is high. The new 
approach is illustrated on two applications: The detection of change 
of a network over time, and the determination of authorship of a 
classic novel. 

1. Introduction. Change-point models are widely used in various fields 
for detecting lack of homogeneity in a sequence of observations. In the typical 
formulation, the observations {yi : i = l,2,...,n} are assumed to have 
distribution Fq for i < t and possibly a different distribution Fi for i > t. 
The parameter r is referred to as the change-point. We consider the case 
where the total length of the sequence n is fixed. There is a rich literature 
on theory and applications of this model when yi are real or integer valued 
scalars. For example, in a well known study of the annual flow volume of 
the Nile River at the city of Aswan, Egypt, from 1871 to 1970, each yi is 
a continuous measurement of the annual discharge from the river (Cobb, 
1978), and the goal is to detect shifts in flow volume. If the distribution 
of yi were assumed to be normal, score- or likelihood- based tests can be 
applied (James, James and Siegmund, 1987). Bayesian and non-parametric 
approaches have also been developed (see Carlstein, Miiller and Siegmund 
(1994) for a survey). 

Modern statistical applications are faced with data of increasing richness 
and dimension. High throughput measurement schemes and digitization in 
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many scientific fields liave produced data sequences {yj : i = 1,2, . . . ,n}, 
where each yj is a high dimensional vector or even a non-Euclidean data 
object. The dimension of each observation can be larger than the length of 
the sequence. Testing the homogeneity of such high dimensional sequences 
is a challenging but important problem. Following are some motivating ex- 
amples: 

Network evolution: Data on networks have become increasingly common. 
For example, e-mail, phone, and on-line chat records can be used to 
construct a network of social interactions among individuals (Kossinets 
and Watts, 2006; Eagle, Pentland and Lazer, 2009). High throughput 
biological experiments have led to the ubiquitous study of protein- or 
gene- interaction networks. A large part of these studies is character- 
izing how the network evolves through time. Here, the observation at 
each time point is a graphical encoding of the network. In a longitudi- 
nal study, one might ask whether there is an abrupt shift in network 
connectivity at any point in time. 

Image analysis: Image data collected through time appears in diverse ap- 
plications, from video surveillance to climatology to neuroscience. The 
detection of abrupt events, such as security breaches, storms, or brain 
activity, can be formulated as a change-point problem. Here, the ob- 
servation at each time point is the digital encoding of an image. 

Text or sequence analysis: Many classic works in both western and east- 
ern literature have ongoing authorship debates. For example, the de- 
bate surrounding both Tirant lo Blanc, a Catalan romance, and Dream 
of the Red Chamber, a Chinese masterpiece, is whether there is a 
change of authorship mid-way through the novel. In the digital era, an 
objective approach to these debates is to statistically test for abrupt 
changes in writing style, which can be reflected by word usage. Sim- 
ilar problems arise in genomic sequence analysis in biology, where it 
is often of interest to find regions of the genome with different DNA- 
word compositions (see, for example, Tsirigos and Rigoutsos (2005)). 
In both settings, each observation in the sequence is a vector of word 
counts over a large dictionary of words. 

Existing approaches for change-point detection in a sequence of multi- 
variate observations are limited in many ways. Most methods are based 
on parametric models that are highly context specific. For example, Zhang 
et al. (2010) and Siegmund, Yakir and Zhang (2011) studied the problem of 
detecting common shifts in mean in multivariate Gaussian sequences with 
identity covariance. Srivastava and Worsley (1986) and James, James and 
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Siegmund (1992) discussed general likelihood ratio tests for a change in 
multivariate normal mean, which requires that the dimension of the obser- 
vations be smaller than the number of observations. None of the parametric 
approaches for multivariate data can be applied in very high dimensions, un- 
less strong assumptions are made to avoid the estimation of the large number 
of nuisance parameters that are a by-product of increasing dimension. 

In the nonparametric domain, Desobry, Davy and Doncarli (2005) and 
Harchaoui, Bach and Moulines (2009) used kernel-based methods. A com- 
mon drawback for kernel-based methods is that they rely heavily on the 
choice of kernel functions and parameters, and the problem becomes more 
severe when applied to high-dimensional data. Lung-Yut-Fong, Levy-Leduc 
and Cappe (2011) used a non-parametric approach based on marginal rank 
statistics, which also requires the restriction that the number of observations 
be larger than the dimension of the data. Although these tests can be quite 
useful in low dimensions, they are impractical when data resides in high 
dimensional sample space. 

In this paper, we describe a nonparametric approach to change-point de- 
tection. The approach can be applied to data in arbitrary dimensions or 
even non-Euclidean data, with a general, analytic formula for type I error 
control. We illustrate the approach on two real problems: Testing the tem- 
poral homogeneity of a social network, and testing for a change in author 
of a classic European novel. We show, via simulations, that as dimension 
increases this nonparametric method gains power over parametric methods 
in cases where the parametric methods can be applied. The generality of 
the new approach and the availability of analytic formulas for type I error 
computation make it an easy off-the-shelf tool for homogeneity testing in 
multivariate settings. 

This paper is organized as follows: In Section 2 we describe the new 
method. The basic underlying idea is graph-based two-sample tests adapted 
to the scan-statistic setting. Two-sample tests based on various types of 
graphs representing the similarity between observations were first proposed 
in Friedman and Rafsky (1979) and Rosenbaum (2005). We review these pre- 
vious works in Section 2.1. Once the graph has been constructed, theoretical 
analysis of the scan statistic can be decoupled from the modeling of the high 
dimensional data. We describe the detection of a single change-point in Sec- 
tion 2.2, and the detection of a changed interval in Section 2.3. Section 3 
gives analytic formulas for approximating the significance of the tests, and 
evaluates their accuracy in numerical studies. Section 4 evaluates the power 
of the test via simulations. In Section 5, the new method is applied to the 
analysis of the Friendship Network data set collected by the MIT Media 
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Laboratory (Eagle, Pentland and Lazer, 2009), and the analysis of the text 
of Tirant lo Blanc. Finally, we conclude with a discussion in Section 6. 

2. A Graph-based Framework for Change-Point Detection. We 

start with a formal formulation of the problem. Consider a sequence of in- 
dependent observations {y^}, z = 1, . . . , n. Let Fq and Fi be two probability 
measures on the sample space. We do not assume that Fq and Fi are known, 
that is, Fq and Fi can be arbitrary, but need to differ on a set of non-zero 
measure. We are concerned with testing the null hypothesis, 

(2.1) i^o : yi ~ -Pb, i = 1, . . . ,n, 

against the alternative, 



(2.2) Ha:3l<T<n, yi 



Fi, i> T 
Fq, otherwise. 



Under the alternative hypothesis, there exists a time point r where the 
distribution of the observations changes abruptly from Fq to Fi. A related 
alternative, which we will also study, is that of a changed interval: 



(2.3) Ha:3l<n<T2<n, Yi 



Fi, i = Tl + 1,. . . ,T2 

Fq, otherwise. 



Under the second alternative, yj changes from Fq to Fi and then back to 
Fq at some later time. 

In both the single change-point and changed interval alternatives, the 
observations are divided into two distinct groups. For a changed interval, we 
may be only interested in changes that persist for a minimum length, that 
is, we may constrain T2 — ti > Iq for some Iq > 1. We may also impose a 
maximum duration with the constraint T2 — T1 < li. In the one change-point 
scenario, we may require that 1 < tiq < t < ni < n. Sometimes, the domain 
knowledge gives us good choices for these values. 

We do not impose any restrictions on the sample space or distribution of 
Yi. Our approach requires that the similarity between can be represented 
by a graph, with edges in the graph connecting observations that are "close" 
in some sense. For the proposed method to have good power, data points 
drawn from Fq need to be closer to each other than to data points drawn 
from Fi, in a global sense, and vice versa. We describe this in more detail 
next, and briefly review graph-based two-sample tests. 
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2.1. Graph-Based Two-Sample Tests. By graph-based tests, we refer to 
tests that are based on graphs with the observations {y?,} as nodes. The 
graph is usually derived from a distance or a generalized dissimilarity on the 
sample space, with edges connecting observations that are close in distance. 
For example, Friedman and Rafsky (1979) proposed the first graph-based 
test for testing the null hypothesis that subjects from two groups are equal 
in distribution against an omnibus alternative. Their method relies on the 
minimum spanning tree (MST), which is a tree connecting all observations 
minimizing the total distance across edges. Their test statistic is the number 
of edges in the tree connecting observations from different groups, rejecting 
the null hypothesis when this count is low compared to its distribution under 
permutation. The rationale is that, if the two groups come from different 
distributions, data points from the same group should be closer to each 
other, and thus edges in the tree should be more likely to connect subjects 
within a group. 

There are many other ways to construct the graph. Rosenbaum (2005) 
proposed minimum distance pairing (MDP), which divides the n subjects 
into n/2 (assuming n is even) non-overlapping pairs in such a way as to 
minimize the total of n/2 distances between pairs. For odd N, Rosenbaum 
suggested creating a pseudo data point that has distance with all other 
subjects, and later discarding the pair containing this pseudo point. This 
method has the desirable property of being truly distribution free, an im- 
portant fact that we will explore in later sections. 

The nearest neighbor graph (NNG), which connects each data point to 
its nearest neighbor, can also be used to define a statistic in similar style to 
Friedman and Rafsky (1979) and Rosenbaum (2005). 

Figure 1 ihustrates the MST, MDP, and NNG on 40 points in Ways 
to construct the graph are not limited to these three. In some applications, 
the graph may be given at the start of the analysis without alluding to 
an underlying distance measure, see the Haplotype example in Chen and 
Zhang (2012). The proposed method does not depend on how the graph was 
constructed. The test statistic and its properties under the permutation null 
rely only on the graph and not on the underlying distance measure nor on 
the original data. However, the quality of the graph in separating Fq and Fi 
is integral to the power of the test. 

2.2. Test Statistic for a Single Change-Point. Here, we consider testing 
the null (2.1) versus the single change-point alternative (2.2). Each possible 
value of r divides the observations into two groups: Those that come before 
and after r. Let G be the graph on yj, as described in Section 2.1. Let Sg be 
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MST MDP NNG 




Fig 1. The MST, MDP and NNG graphs on an example two-dimensional data set. 20 
points were drawn from A/'(0,/2) (shown in triangles) and 20 points were drawn from 
M[{2,2)' , I2) (shown in circles). 



the set of edges in G. For any event x let be the indicator function that 
takes value 1 if 2; is true, and otherwise. Then, at any time t, the number 
of edges connecting points from different groups is 

R{'t)= ^ ^g^it)f^gj{t), 9i{t)=h>t- 

As in the Friedman-Rafsky and Rosenbaum's tests, small values of R{t) are 
evidence against the null. To standardized R{t) so that it is comparable 
across t, let 

(2.4) = ^W-'^P">' 

\/Var|ii(t)] 

In the standardization, we also invert the sign, so that large values of Z{t) are 
evidence against the null. The expectation and variance above are defined 
under the permutation null, which places 1/n! probability on each of the 
n! permutations of {yj : i = l,...,n}. Lemma 2.1 below gives analytic 
formulas for E[i?(t)] and Var[i?(t)]. Figure 2 illustrates the computation of 
R{t) on a small artificial data set of length n = 40 with a true change-point 
at r = 20, and shows plots of the R(t) and Z{t) processes. 
The scan statistic for testing Hq versus Ha is 



(2.5) 



max Z{t), 

no<.t<.ni 
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Fig 2. T/ie nine p/ots m the first three rows illustrate the computation of R{t) for nine 
different values of t. The data is a sequence of length n — 40, with the first 20 points 
drawn from Af(0, 12) and the second 20 points drawn from N{(2, 2)', 72). The MST based 
on Euclidean distance is used as G, whose edges are also shown in the plots. Each t 
divides the observations into two groups, one group for observations before and at t (shown 
as triangles) and the other group for observations after t (shown as circles). Edges that 
connect observations from the two different groups (i.e. edges connecting a triangle and a 
circle) are bold in the graph. R{t) is the count of the number of bold edges. G does not 
change as t changes, but the group identities of some observations change, causing R{t) 
to change. The profile of R{t) and Z{t) for t = 1, . . . ,n are plotted in the fourth row. 
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where no and ni are pre-specified constraints for the range of r. 

Lemma 2.1. Under the permutation null, the expectation and variance 
of R{t) are 



B[R{t)]=pi{t)\£G\, 

^1 



Var[R{t)] = p2it)\£G\ + 



^Pi{t)-P2{t) 



Y^\£,\^ + [p,it) - plit)] \£g\', 



where £i = {j : {i,j) G £g} is the set of edges containing node i, and 
, , 2t(n - t) 

P2{t) 



n{n — 1) ' 

4t{t - l)(n - t){n-t- 1) 
n(n — l)(n — 2)(n — 3) 



The proof of this lemma is in Appendix A.l. 

Remark 2.1. The expectation and variance of R{t) under the permu- 
tation null depend only on t, n, and two characteristics of the graph - the 
number of edges (\£g\) o,i^d the sum of squares of node degrees li^iPj- 

2.3. Test Statistic for a Changed Interval. Next consider testing Hq ver- 
sus the changed interval alternative (2.3). Similar to the single change-point 
case, any specific alternative (ti,i2) divides the data into two groups, one 
group containing all points observed during {ti,t2], and the other group con- 
taining points observed outside of this interval. Then, the number of edges 
in G connecting data points from different groups is 

R{h,t2) = ^ hi{tiM)Hj(ti,t2)' gi{h,t2) = Iti<i<t2- 

We standardize R{ti,t2) as before, 

R{ti,t2)-E[R{ti,t2)] 



Z{tl,t2) 



y/Yar[R{ti,t2)] 



Lemma 2.2 gives expressions for E[i?(ti,t2)] and Var[i?(ti, ^2)] under the 
permutation null. The scan statistic involves a maximization over ti and t2, 



(2.6) 



max Z{ti,t2) 

lQ<t2—tl<ll 
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where Iq and li are constraints on the window size. For example, we can set 
li = n — Iq so that only alternatives where the number of observations in 
either group is larger than Iq are considered. 

Lemma 2.2. Under the permutation null, the expectation and variance 
of R{ti,t2) are 



Y,[R{tiM)] = Pi{t2 - ti)\£Gl 

"1 



2^1(^2 - ti) -P2{t2 - h) 



Var[R{ti,t2)] = P2{t2 - ti)\£G\ 

+ [P2{t2 - h) - pl{t2 - tl)] \£g\\ 

where pi{-) and p2{-) are defined in Lemma (2.1). 
This lemma follows easily from Lemma 2.1. 

Remark 2.2. We can also constrain ti and t2 to a prefixed region us- 
ing background information. The procedure will be similar and the p-value 
approximations in Section 3 are up to some minor modifications. 

3. Analytic Approximations to Significance Levels. How large do 
the values of the scan statistics (2.5) and (3.2) need to be to constitute suffi- 
cient evidence against the null hypothesis of homogeneity? We are concerned 
about the tail distribution of the scan statistics under Hq, that is, 



(3.1) P max Z{t) > b 

\no<t<ni 

for the single change-point alternative, and 



(3.2) P max Z(ti,t2)>b 

\lo<t2-ti<h 

for the changed interval alternative. The probabilities in (3.1) and (3.2) 
are defined under the permutation distribution, where the order of is 
permuted. Under the null hypothesis, the observations are iid distributed, 
the scan statistic calculated under any permutation of time would follow the 
same distribution. Treating {Z(t)} and {Z{ti,t2)} as families of dependent 
tests, the probabilities above are simply the family-wise error rate. 

When the number of observations n is not too large, we can sample di- 
rectly from the permutation distribution to calculate a permutation p-value. 
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When n is large, permutation would be computationally prohibitive, espe- 
cially for (3.2) where each scan is of order O(n^) if li — Iq = 0{n). 

In this section, we give accurate analytic approximations to (3.1) and 

(3.2) . The approximations are obtained using the method developed by 
Woodroofe (1976, 1978) for one-dimensional processes and extended by 
Siegmund (1988, 1992) to random fields. We rely on the fact that, un- 
der some assumptions regarding the graph, {Z{nu) : < u < 1} and 
{Z{nui,nu2) : < < n2 < 1} converge to Gaussian processes as n — )• oo, 
which leads to our first approximations in Section 3.1. The Gaussian approx- 
imation is inaccurate when the minimum window size is small or the graph 
dominated by hubs. We derive more accurate analytic approximations by 
correcting for skewness, using an approach similar to Tang and Siegmund 
(2001), in Section 3.2. Results of numerical studies evaluating the accuracy 
of the analytic approximations are given in Section 3.3. 

3.1. Approximations Assuming Gaussianity. As n —>• cx), assuming f/n —t- 
u G (0, 1), Z{t) converges under the permutation null distribution to a stan- 
dard normal under certain conditions on the graph G. Sufficient conditions 
for convergence have been given in Friedman and Rafsky (1979), Rosen- 
baum (2005), and Chen and Zhang (2012). Friedman and Rafsky (1979) 
showed, by casting R{t) in the form of a generalized correlation coefficient, 
that asymptotic normality holds under the assumption that the number of 
edge pairs sharing a common node is 0{n). Rosenbaum (2005) showed that 
convergence occurs for MDP. Chen and Zhang (2012) studied a broader 
class of statistics and showed using Stein's method that convergence occurs 
if maxj \£i\^/n — )• 0. 

A hub in a graph is a node with many more connections than the aver- 
age node. Friedman and Rafsky's condition requires that both the size and 
density of hubs in the graph can not grow too quickly with n. Chen and 
Zhang's condition limits only the size of the biggest hub, but in a much 
more stringent way. As a general rule, the larger and denser the hubs in the 
graph, the slower the convergence to normal. 

3.1.1. A Single Ghange-Point. 

Proposition 3.1. Let uq = aon, ni = aiu, and h = bQU^^'^ for arbitrary 
ao,ai,bo with < oq < ai < 1 and bo > 0. Let Z{t) be a discrete time 
Gaussian process with E[Z(t)] = 0, E[Z^(t)] = 1, and the same covariance 
function as Z{t). Then as n ^ oo, 

(3.3) P ( max Z{t) > b] ^ ^ / ' nb^h{u)iy (y^2b^h{u)) du, 



no<t<ni 



GRAPH-BASED CHANGE-POINT DETECTION 11 

where 

(3.4) v{x) = 23;-^exp |-2 ^ m"^«> ^-^xm^/^^ | , 

and $(•) the density and cumulative distribution of the standard normal 
distribution, and 
(3.5) 

{n-l)[h{n,u)\£G\+f2{n,u) T.l=i\^^ - h{n.u)\8G?] 



h{u 
where 



2nu{l-u)[U{n,u)\£G\+h{n,u) Er=i l^d' " /6(^> ^)I^gP] ' 

/i(n, u) = 4n(n — 1) (— 2nn^ -|- 2nu — n) 

f2{n,t) = n [n(n-M)(l - 2uf - 2(n - 1)] 

/3(n,t) = 4n [n(l-2u)2-l] 

/4(ra, t) = 4n(n — l)(nn — l)(n — nu — 1) 

/5(n, t)=n(n-\) \n^{\ - 2uf -n + 2] 

fQ{n, t) = 4n [n2(l - 2uf - 2n(l - 3n Su"^) + l] . 

The approximation (3.3) can be obtained using the method of Woodroofe 
(1976, 1978), and also by a different approach described in Siegmund, Yakir 
and Zhang (2010). In Appendix A. 2, we show the main argument following 
Woodroofe's method. 

The function h(u) (3.5) is the absolute one-sided derivative 

(3.6) hiu) = p',{0~) =9_p„(0) = -a+p„(0), 

where pu{d) = Cor(Z(n(n - 6)), Z{nu)) = Cor{Z{n{u - 5)), Z{nu)). 

Remark 3.1. Pu(O^) only depends on u, n and the two characteristics 
of the graph — number of edges (\£g\) o^*^ sum of squares of degrees of 
nodes (X]"=x I'^iP/ Thus, the analytic approximation to p-values only de- 
pends on these two characteristics of the graph and the prespecified param- 
eters (ao, ai, bQ,n). 

Remark 3.2. The function v (3.4) is closely related to the Laplace trans- 
form of the overshoot over the boundary of a random walk. A simple ap- 
proximation given in Siegmund and Yakir (2007) is sufficient for numerical 
purpose: 

(3 7) u(x) ~ (2/^)(^(^/2)-0-5) 

^ ■ ^ ^ ^ ^ {x/2)<^>{x/2) + (j){x/2)' 
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3.1.2. A Changed Interval. 



Proposition 3.2. Let Iq 



aon, li 



ain, and b = bon^^'^ for ar- 



bitrary ao,ai,bo with < ao < < 1 o,i^d bo > 0. Let Z{ti,t2) be 
a two-dimensional discrete time Gaussian process, with E[Z(ti,t2)] = 0, 
'E[Z'^{ti,t2)] = 1 and the same covariance function as Z{ti,t2). Then as 
n — )• oo, 



where h{-) and are defined in (3.5) and (3.4), respectively. 

The above proposition can be proved using the method described in Sieg- 
mund (1988, 1992). We see that the approximation (3.8) involves the same 
quantities as the approximation (3.3) for the one dimensional process. Some 
intuitions are given in the sketch of the proof in Appendix A. 3. 

3.2. Skewness Correction. Convergence of Z{t) to normahty is slow if 
t/n is close to or 1. Also, NNG and MST constructed on high dimensional 
data can be dominated by hubs under standard distance measures, such as 
L2 or Li. We see this to be true, for example, under L2 distance when the 
dimension is 100 in the simulations of Section 3.3. In our simulations we have 
noticed that if the underlying graph is void of hubs, then the statistic Z{t) 
is right-skewed, and the approximations (3.3) and (3.8) underestimate the 
true tail probabilities. If a graph is dominated by hubs, then the statistic is 
left-skewed, and the approximations overestimate the tail probabilities. The 
effects of skewness are explored in more detail in Appendix B.3. 

3.2.1. A S ingle Change-Point. Let 7(t) = E[Z^{t)], and let Obit) = (-1 + 
Y^l + 2j{t)b)/'j{t). Then, a better approximation for the significance of the 
single change-point scan statistic is 



(3.8) 




max 





(3.9) 




max 

no<t<ni 




'U. 



where 
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The details of this approximation are shown in Appendix B.l. The ana- 
lytic expression for the skewness term 7(i) is given in Lemma 3.1 next. Note 
that, if 7(i) = 0, then 6'fc(i) = b and S{t) = 1. 

Remark 3.3. The term 6b{t) is an approximation to the solution of 
= b, where ipt{(^) is the cumulant generating function of Z{t). By 
a third order Taylor approximation to iptiG), we have ip^ (b) ~ ( — 1 -|- 
^yY+~2rf(t)b) / j(t) . When the marginal distribution is left-skewed, it is pos- 
sible that 7(t) can be too small for 1 -\- 2"f{t)b to be positive. This does not 
mean that the solution to iptid) = b does not exist, but that higher moments 
are needed to get a good approximation. In this paper, we apply an easy 
heuristic fix to this problem: Since 1 -|- 2j{t)b < usually happens when t/n 
is close to or 1, within this problematic region 9b{t) can be extrapolated 
using its values outside the region. The details of the extrapolation method 
are shown in Appendix B.3. 

Lemma 3.1. 

7(t) = E[Z it)] = ^ Var[R{tW/^ ' 

where E[i?(t)] and Var[R{t)\ are given in Lemma 2.1, and 

E[i?3(t)] = p,{t)\£G\ + \pi{t) \m£i\ - 1) 

i 

+ 3p,{t) [\£GmG\ - 1) + ^ E i^^Ki^^i - - i^'d) 
+p,{t)Y\msi\-im\-2) 

i 

+P4(t) (i^:gI(i^:gI-i)(i^:gI-2)+6 m-m^o\-^) 

-2p^{t) Y \{k:ii,k),{j,k)G£G}\ 
-P4{t) (^\£i\{\£i\-im£G\-2\£i\-2)^ . 
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The functions pi{t) and p2{t) are given in Lemma 2.1, and 



P3{t) 



t{n -t)[{n-t- l)(n - t - 2) + {t - l){t - 2)] 

n(n - l)(n - 2)(n - 3) 
8t{t - l){t - 2)(n - t){n - t - l)(n - t - 2) 
n(n - l)(n - 2)(n - 3)(n - 4)(n - 5) 



The proof of this Lemma is in Appendix B.2. The terms in E[i?'^(t)] can 
be rearranged and written in other forms. The expansion shown here makes 
it easier to understand the origin of each term in the context of the proof. 

Remark 3.4. For MDP, E[i?^(t)] has a much simpler expression: 

B[R^{t)] = pi{t)n + p2{t)3n{n - 1) + p4{t)n{n - l)(n - 2) 

= {pi{t) - 3p2{t) + 2pi{t))n + 3{p2{t) - P4{t))n'^ + Pi{t)n\ 

The derivation is shown in Appendix B.2. 

3.2.2. A Changed Interval. The significance approximation for the inter- 
val scan after skewness correction is 



where S{-) is given in (3.10), ao = lo/n,ai = li/n. The derivation for (3.11) 
almost exactly parallels the single change-point case. 

3.3. Numerical Studies. To check the analytic approximations to p- values, 
we compare the critical values obtained from (3.3), (3.9), (3.8), and (3.11) to 
those obtained from permutation, under various simulation settings. In each 
simulation, iid sequences of length 1000 were generated from a given distri- 
bution Fq in JR'^. MST, MDP, and NNG were constructed on the data based 
on Euclidean distance. For each graph, analytic and permutation critical 
values were computed for both 0.05 and 0.01 p-value thresholds. 

Tables 1-5 show the results for the single change-point alternative. In 
the column headers, "Al" denotes the critical values obtained assuming 
Gaussianity (3.3), "A2" denotes the critical values obtained after correcting 
for skewness (3.9), and "Per" denotes the critical values obtained by 10,000 
permutations. 



(3.11) 
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Six different choices for Fq are shown, for two different distributions (stan- 
dard normal and exponential with mean 1), each in three different dimen- 
sions (d=l,10, or 100). For d = 10 or 100, each element of the data vector is 
generated independently from the given distribution. The analytic approxi- 
mations depend also on constraints on the region in which the change-point 
is searched. These are reflected in the choice of no and ni (Iq and li for the 
changed interval alternative). To make things simple, we set rii = n — hq, 
so that we only allow the case that both groups have at least no observa- 
tions. In general, the analytic approximations become less precise when the 
minimum segment length decreases. This is mainly because the Gaussian ap- 
proximation (and skewness correction) to the distribution of Z{t) degrades 
for small samples. 

Both the analytic and permutation p-values depend on certain charac- 
teristics of the graph's structure. The structures of MST (for d > 2) and 
NNG depend on the underlying data set, and thus the critical values vary by 
simulation run. In such cases, we show results for 5 randomly simulated se- 
quences. Two characteristics of the graph are also shown for each simulated 
sequence: The sum of squared node degrees (^^ and the maximum 

node degree {D). These quantities give some intuition on the size and den- 
sity of hubs in the graph. Since the MST for any one-dimensional data set is a 
chain, the critical values for MST-based scan do not change with simulation 
run for each setting of the parameters. 

The structure of the MDP graph is always the same for all data sets. 
Therefore, the critical values for MDP-based scan depend only on n, no, ni 
(/o and li for the changed interval alternative). The critical values for MDP- 
based scan do not depend on the dimension or the underlying distribution 
of the data. As emphasized in Rosenbaum (2005), it is a truly distribution 
free method, which can be desirable in high dimensions. 

We can see from the tables that the analytic approximations after skew- 
ness correction perform much better than the analytic approximations under 
Gaussian assumption, especially when dimension increases. The accuracy of 
the skew-corrected approximation does not degrade significantly with dimen- 
sion. For the statistics based on MDP, the skew-corrected approximations 
work quite well when the minimum window size is as small as 25 at 0.05 
significance level, and 50 at 0.01 significance level. For the statistics based 
on MST and NNG, the skew corrected approximations remain accurate for 
window sizes as small as 25 at both 0.05 and 0.01 significance levels. 

There is not much difference between results for simulations based on 
normal and those based on exponential distributions. The main factor in- 
fluencing approximation accuracy, other than the minimum window size, is 
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the dimension (d). As dimension increases, the graph becomes more "star- 
shaped" as reflected by the increase in both ^ l^jp and D. As shown in 
Section 3.2, skewness and other higher order moments of Z(t) are a function 
of polynomials of the node degrees. Thus the increase in the number and 
density of hubs makes skewness correction important in high dimensions. 

For the changed interval alternative, the results are similar, with details 
in Appendix C. 



Table 1 

Critical values for the single change-point scan statistic based on MST at 0. 05 
significance level, n — 1000. 





Critical Values 


Graph 




no = 100 


no = 50 


no = 25 








Al 


A2 


Per 


Al 


A2 


Per 


Al 


A2 


Per 




D 


d= 1 


2.98 


3.05 


3.04 


3.08 


3.22 


3.23 


3.14 


3.39 


3.49 


4994 


2 




2.92 


2.90 


2.90 


3.00 


2.95 


2.95 


3.05 


2.98 


2.96 


5430 


8 


N(0,1) 


2.92 


2.89 


2.89 


3.00 


2.95 


2.92 


3.05 


2.97 


2.95 


5438 


7 


d = 10 


2.92 


2.90 


2.87 


3.00 


2.95 


2.94 


3.05 


2.98 


2.96 


5394 


7 




2.92 


2.89 


2.86 


3.00 


2.94 


2.90 


3.05 


2.97 


2.92 


5534 


8 




2.92 


2.89 


2.89 


3.00 


2.95 


2.92 


3.05 


2.97 


2.95 


5460 


7 




2.93 


2.91 


2.89 


3.01 


2.97 


2.96 


3.06 


3.00 


2.97 


5064 


7 


Exp(l) 


2.93 


2.91 


2.88 


3.01 


2.97 


2.92 


3.06 


3.00 


2.95 


5082 


7 


d = 10 


2.93 


2.91 


2.91 


3.01 


2.98 


2.97 


3.06 


3.01 


3.00 


5028 


5 




2.93 


2.91 


2.87 


3.01 


2.98 


2.93 


3.06 


3.01 


2.97 


5028 


6 




2.93 


2.91 


2.88 


3.01 


2.96 


2.92 


3.06 


2.98 


2.94 


5180 


9 




2.86 


2.69 


2.68 


2.94 


2.70 


2.68 


3.00 


2.70 


2.68 


12454 


38 


N(0,1) 


2.86 


2.72 


2.72 


2.95 


2.74 


2.72 


3.00 


2.74 


2.72 


10904 


38 


d = 100 


2.86 


2.70 


2.66 


2.94 


2.71 


2.66 


3.00 


2.71 


2.66 


11294 


42 




2.87 


2.72 


2.68 


2.95 


2.74 


2.68 


3.00 


2.74 


2.68 


10690 


40 




2.86 


2.69 


2.65 


2.94 


2.70 


2.65 


3.00 


2.70 


2.65 


11722 


40 




2.85 


2.64 


2.60 


2.93 


2.65 


2.60 


2.99 


2.65 


2.60 


14706 


56 


Exp(l) 


2.87 


2.77 


2.76 


2.95 


2.80 


2.77 


3.01 


2.81 


2.77 


9608 


25 


d = 100 


2.84 


2.62 


2.53 


2.93 


2.62 


2.53 


2.99 


2.62 


2.53 


15536 


77 




2.86 


2.74 


2.69 


2.95 


2.76 


2.69 


3.00 


2.76 


2.69 


10890 


30 




2.86 


2.72 


2.66 


2.94 


2.73 


2.66 


3.00 


2.73 


2.66 


12018 


39 



4. Power Comparisons. We used simulations to compare the power 
of the graph-based scan statistics to parametric approaches. In the first 
simulation set-up, we generated a sequence of 200 observations from the 
following model: 



N{0,ld), t = l,...,100; 
iV(/i,S), t = 101,..., 200. 

As before, d is the dimension of each observation. There is a change-point at 
100. The mean ^ of the second half of the data is shifted from by amount 
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Table 2 

Critical values for the single change-point scan statistic based on MST at 0.01 
significance level, n = 1000. 





Critical Values 


Graph 




no = 100 


no = 50 


no = 25 








Al 


A2 


Per 


Al 


A2 


Per 


Al 


A2 


Per 




D 


d= 1 


3.52 


3.62 


3.67 


3.60 


3.81 


3.85 


3.65 


4.05 


4.31 


4994 


2 




3.47 


3.43 


3.46 


3.53 


3.46 


3.48 


3.57 


3.48 


3.48 


5430 


8 


N(0,1) 


3.47 


3.43 


3.44 


3.53 


3.46 


3.46 


3.57 


3.47 


3.46 


5438 


7 


d= 10 


3.47 


3.43 


3.44 


3.53 


3.46 


3.47 


3.58 


3.48 


3.48 


5394 


7 




3.47 


3.42 


3.38 


3.53 


3.46 


3.40 


3.57 


3.47 


3.41 


5534 


8 




3.47 


3.43 


3.44 


3.53 


3.46 


3.46 


3.57 


3.47 


3.46 


5460 


7 




3.48 


3.45 


3.40 


3.54 


3.49 


3.44 


3.58 


3.50 


3.45 


5064 


7 


Exp(l) 


3.48 


3.44 


3.40 


3.54 


3.48 


3.42 


3.58 


3.50 


3.44 


5082 


7 


d= 10 


3.48 


3.45 


3.47 


3.54 


3.49 


3.49 


3.58 


3.51 


3.52 


5028 


5 




3.48 


3.45 


3.41 


3.54 


3.49 


3.44 


3.58 


3.51 


3.46 


5028 


6 




3.48 


3.44 


3.49 


3.54 


3.47 


3.53 


3.58 


3.48 


3.54 


5180 


9 




3.42 


3.17 


3.19 


3.48 


3.17 


3.19 


3.53 


3.17 


3.19 


12454 


38 


N(0,1) 


3.42 


3.21 


3.24 


3.49 


3.21 


3.24 


3.53 


3.21 


3.24 


10904 


38 


d= 100 


3.42 


3.19 


3.17 


3.49 


3.19 


3.17 


3.53 


3.19 


3.17 


11294 


42 




3.42 


3.22 


3.18 


3.49 


3.22 


3.18 


3.53 


3.22 


3.18 


10690 


40 




3.42 


3.18 


3.21 


3.49 


3.18 


3.21 


3.53 


3.18 


3.21 


11722 


40 




3.41 


3.14 


3.12 


3.48 


3.14 


3.12 


3.52 


3.14 


3.12 


14706 


56 


Exp(l) 


3.43 


3.28 


3.26 


3.49 


3.28 


3.26 


3.54 


3.28 


3.26 


9608 


25 


d= 100 


3.41 


3.15 


3.10 


3.48 


3.15 


3.10 


3.52 


3.15 


3.10 


15536 


77 




3.42 


3.24 


3.21 


3.49 


3.24 


3.21 


3.53 


3.24 


3.21 


10890 


30 




3.42 


3.22 


3.13 


3.48 


3.22 


3.13 


3.53 


3.22 


3.13 


12018 


39 



Table 3 

Critical values for the single change-point scan statistic based on MDP. n = 1000. 



significance level = 0.05 







d=l 


d= 10 


d= 100 


no 


Al A2 


N(0,1) Exp(l) 


N(0,1) Exp(l) 


N(0,1) Exp(l) 


200 


2.82 2.84 


2.83 2.81 


2.85 2.85 


2.85 2.83 


100 


2.98 3.07 


3.06 3.04 


3.08 3.08 


3.07 3.05 


50 


3.08 3.27 


3.30 3.29 


3.35 3.36 


3.35 3.31 


25 


3.14 3.48 


3.54 3.58 


3.57 3.66 


3.60 3.60 



significance level = 0.01 







d=l 


d = 10 


d= 100 


no 


Al A2 


N(0,1) Exp(l) 


N(0,1) Exp(l) 


N(0,1) Exp(l) 


200 


3.38 3.43 


3.39 3.38 


3.44 3.46 


3.45 3.44 


100 


3.52 3.66 


3.66 3.64 


3.67 3.75 


3.67 3.59 


50 


3.60 3.90 


3.99 3.99 


3.94 4.05 


3.95 3.99 


25 


3.65 4.21 


4.61 4.65 


4.78 4.72 


4.59 4.81 
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Table 4 

Critical values for the single change-point scan statistic based on NNG at 0. 05 
significance level, n = 1000. 





Critical Values 


Graph 




no = 100 


no = 50 


no = 25 








Al 


A2 


Per 


Al 


A2 


Per 


Al 


A2 


Per 


D 




2.96 


2.98 


2.95 


3.04 


3.07 


3.03 


3.10 


3.13 


3.08 


2008 


2 


N(0,1) 


2.96 


2.98 


2.97 


3.05 


3.07 


3.05 


3.10 


3.13 


3.09 


1972 


2 


d = l 


2.96 


2.98 


3.01 


3.04 


3.07 


3.10 


3.10 


3.12 


3.13 


2032 


2 




2.96 


2.98 


2.97 


3.04 


3.07 


3.04 


3.10 


3.13 


3.09 


2008 


2 




2.96 


2.99 


3.01 


3.05 


3.08 


3.10 


3.10 


3.13 


3.13 


1954 


2 




2.96 


2.99 


2.98 


3.05 


3.08 


3.08 


3.10 


3.13 


3.11 


1948 


2 


Exp(l) 


2.96 


2.98 


2.96 


3.04 


3.07 


3.07 


3.10 


3.13 


3.11 


2038 


2 


d=l 


2.96 


2.98 


2.96 


3.04 


3.08 


3.05 


3.10 


3.13 


3.09 


2014 


2 




2.96 


2.98 


2.99 


3.04 


3.07 


3.08 


3.10 


3.12 


3.13 


2008 


2 




2.96 


2.98 


2.99 


3.04 


3.08 


3.08 


3.10 


3.13 


3.13 


2038 


2 




2.94 


2.92 


2.89 


3.02 


2.97 


2.93 


3.07 


3.00 


2.96 


3370 


6 


N(0,1) 


2.94 


2.91 


2.90 


3.02 


2.97 


2.95 


3.07 


2.99 


2.96 


3502 


6 


d= 10 


2.94 


2.91 


2.89 


3.01 


2.96 


2.95 


3.06 


2.98 


2.96 


3444 


7 




2.94 


2.91 


2.91 


3.01 


2.96 


2.94 


3.06 


2.98 


2.96 


3436 


6 




2.94 


2.91 


2.88 


3.02 


2.97 


2.93 


3.07 


2.99 


2.94 


3330 


6 




2.94 


2.92 


2.91 


3.02 


2.98 


2.96 


3.07 


3.00 


2.98 


3144 


5 


Exp(l) 


2.94 


2.92 


2.92 


3.02 


2.98 


2.97 


3.07 


3.00 


2.99 


3096 


6 


d=10 


2.94 


2.92 


2.92 


3.02 


2.98 


2.98 


3.07 


3.01 


3.01 


3118 


6 




2.94 


2.93 


2.92 


3.02 


2.98 


2.97 


3.07 


3.01 


2.99 


3114 


5 




2.94 


2.92 


2.91 


3.02 


2.98 


2.98 


3.07 


3.01 


3.00 


3152 


6 




2.87 


2.65 


2.62 


2.95 


2.65 


2.62 


3.00 


2.65 


2.62 


9382 


52 


N(0,1) 


2.87 


2.73 


2.70 


2.95 


2.75 


2.71 


3.01 


2.76 


2.71 


8466 


24 


d= 100 


2.88 


2.76 


2.72 


2.96 


2.78 


2.72 


3.01 


2.79 


2.72 


7756 


20 




2.86 


2.59 


2.56 


2.94 


2.59 


2.56 


3.00 


2.59 


2.56 


11092 


68 




2.87 


2.68 


2.64 


2.95 


2.69 


2.64 


3.00 


2.69 


2.64 


9538 


38 




2.86 


2.71 


2.70 


2.95 


2.72 


2.70 


3.00 


2.73 


2.70 


10222 


34 


Exp(l) 


2.86 


2.72 


2.68 


2.95 


2.74 


2.69 


3.00 


2.74 


2.69 


10390 


37 


d= 100 


2.86 


2.70 


2.64 


2.94 


2.71 


2.64 


3.00 


2.71 


2.64 


11574 


35 




2.87 


2.74 


2.72 


2.95 


2.76 


2.73 


3.01 


2.77 


2.73 


8782 


22 




2.87 


2.73 


2.68 


2.95 


2.74 


2.68 


3.01 


2.74 


2.68 


8622 


41 
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Table 5 

Critical values for the single change-point scan statistic based on NNG at 0. 01 
significance level, n = 1000. 





Critical Values 


Graph 




no = 100 


no = 50 


no = 25 








Al 


A2 


Per 


Al 


A2 


Per 


Al 


A2 


Per 


D 




3.50 


3.53 


3.53 


3.57 


3.61 


3.59 


3.61 


3.65 


3.63 


2008 


2 


N(0,1) 


3.50 


3.54 


3.52 


3.57 


3.61 


3.63 


3.61 


3.65 


3.66 


1972 


2 


d = 1 


3.50 


3.53 


3.58 


3.57 


3.61 


3.66 


3.61 


3.65 


3.71 


2032 


2 




3.50 


3.53 


3.56 


3.57 


3.61 


3.63 


3.61 


3.65 


3.68 


2008 


2 




3.50 


3.54 


3.53 


3.57 


3.62 


3.64 


3.61 


3.66 


3.65 


1954 


2 




3.50 


3.54 


3.50 


3.57 


3.62 


3.61 


3.61 


3.66 


3.64 


1948 


2 


Exp(l) 


3.50 


3.53 


3.57 


3.57 


3.61 


3.63 


3.61 


3.65 


3.65 


2038 


2 


d=l 


3.50 


3.54 


3.52 


3.57 


3.61 


3.63 


3.61 


3.66 


3.66 


2014 


2 




3.50 


3.53 


3.60 


3.57 


3.61 


3.66 


3.61 


3.65 


3.71 


2008 


2 




3.50 


3.54 


3.54 


3.57 


3.62 


3.58 


3.61 


3.66 


3.66 


2038 


2 




3.48 


3.45 


3.46 


3.55 


3.49 


3.48 


3.59 


3.50 


3.49 


3370 


6 


N(0,1) 


3.48 


3.44 


3.47 


3.54 


3.48 


3.48 


3.59 


3.49 


3.48 


3502 


6 


d= 10 


3.48 


3.44 


3.42 


3.54 


3.47 


3.45 


3.58 


3.48 


3.46 


3444 


7 




3.48 


3.44 


3.43 


3.54 


3.47 


3.46 


3.59 


3.48 


3.47 


3436 


6 




3.48 


3.44 


3.44 


3.55 


3.48 


3.48 


3.59 


3.49 


3.48 


3330 


6 




3.49 


3.45 


3.46 


3.55 


3.49 


3.51 


3.59 


3.50 


3.51 


3144 


5 


Exp(l) 


3.49 


3.45 


3.48 


3.55 


3.49 


3.52 


3.59 


3.50 


3.52 


3096 


6 


d= 10 


3.49 


3.46 


3.48 


3.55 


3.49 


3.54 


3.59 


3.51 


3.57 


3118 


6 




3.49 


3.46 


3.41 


3.55 


3.50 


3.46 


3.59 


3.51 


3.46 


3114 


5 




3.49 


3.46 


3.49 


3.55 


3.49 


3.52 


3.59 


3.51 


3.53 


3152 


6 




3.42 


3.13 


3.07 


3.49 


3.13 


3.07 


3.54 


3.13 


3.07 


9382 


52 


N(0,1) 


3.43 


3.21 


3.19 


3.50 


3.21 


3.19 


3.54 


3.21 


3.19 


8466 


24 


d= 100 


3.44 


3.25 


3.23 


3.50 


3.25 


3.23 


3.54 


3.25 


3.23 


7756 


20 




3.42 


3.09 


3.08 


3.48 


3.09 


3.08 


3.53 


3.09 


3.08 


11092 


68 




3.42 


3.16 


3.16 


3.49 


3.16 


3.16 


3.54 


3.16 


3.16 


9538 


38 




3.42 


3.20 


3.19 


3.49 


3.20 


3.19 


3.53 


3.20 


3.19 


10222 


34 


Exp(l) 


3.42 


3.22 


3.21 


3.49 


3.22 


3.21 


3.53 


3.22 


3.21 


10390 


37 


d= 100 


3.42 


3.18 


3.17 


3.48 


3.18 


3.17 


3.53 


3.18 


3.17 


11574 


35 




3.43 


3.23 


3.23 


3.49 


3.23 


3.23 


3.54 


3.23 


3.23 


8782 


22 




3.43 


3.22 


3.24 


3.50 


3.22 


3.24 


3.54 


3.22 


3.24 


8622 


41 
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A in Euclidean distance. We considered cases where the covariance matrix 
remains constant (S = Id), as well as cases where the covariance matrix 
also changes. When the covariance matrix changes, we set S to a diagonal 
matrix with 1] = d^^^ and = 1 for i = 2, . . . ,d. We chose A for 

each value of d so that most methods have moderate power. 

Hotelling's is a parametric test designed specifically for detecting a 
change in multivariate normal mean when there is no change in variance. 
When there is a change in both mean and variance, the generalized likelihood 
ratio test (GLR) can be used. We compare the graph-based scan statistics 
to scan statistics based on these two existing methods. For any candidate 
change-point t, the Hotelling's is 

n 

where 

t n 

yt = ^yilt, ft = ^ yi/in-t), 

i=l i=t+l 
' t 

S = (n - 2)-^ 

.i=l i=t+l 



^{yi - yt){yi - ytf + ^{yi- yt){yi - y?)^ 



The GLR is 

GLR{t) = nlog|S„| -t\og\tt\ - (n - t) \og\t*\ 



where 



E*=i(yi - yt){yi - ytf _ E"=t+i(yj - yDiyi - ytV 



t ' n-t 

Some constraints apply to T^(t) and GLRit). For the number of observa- 
tions n needs to be larger than the dimension of the data d so that S can be 
inverted. For GLR, both t and n — t need to be larger than the dimension of 
the data so that the determinants of S^, SJ' are not zero. Thus, when d < 20, 
we set no = d + 10 and ni = n — uq. When d > 20, we set no = 50 and 
ni = 150. (An exception for GLR is that when d = 50, no and ni are set to 
60 and 140, respectively, so that the test statistic can be calculated.) 

Table 6 shows the power comparisons. Scan statistics based on the three 
ways of constructing the graph - MST, MDP and NNG - using Euclidean 
distance are compared to scan statistics based on maximization of T^{t) 
and GLR{t). First, compare the graph-based methods to Hotelling's T^: 
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When the variance does not change, outperforms ah other methods in 
low to moderate dimension [d < 175). This is expected, as was designed 
specifically for this scenario. Remarkably, graph-based methods surpass 
at its own game when dimension is high (d = 175). Now, consider the case 
where the variance also changes. By assuming an incorrect alternative, the 
power of is quickly surpassed by graph-based methods, for d as low as 5. 

Comparing graph-based methods to GLR-based scan statistic, we see a 
similar pattern: When dimension is low (d = 1,5,10), GLR-based scans 
dominate in power when both the mean and variance changes. Graph-based 
methods exceed GLR in power when d increases, already performing much 
better by d = 20, which is considered quite moderate in today's applications. 
The low power of GLR at even moderate dimension is due to its requirement 
that the covariance matrix be estimated for both segments. 

We also considered a case where the normality assumption is violated by 
generating data from the log- normal distribution (S = 1^)- Then, graph- 
based methods outperform by d = 75, and GLR hy d = 5. 

Comparing among the graph-based scan statistics, we see that MST and 
NNG have comparable power, and dominate MDP in all situations. An ex- 
planation is that, of these three ways of constructing graphs, the MDP re- 
tains the least information from the data, having half as much edges as the 
other two graphs. The fact that MST and NNG have similar power in all 
scenarios suggests that the graph-based method is not very sensitive to the 
method of graph construction. 

5. Real Data Examples. We illustrate the new approach on two dif- 
ferent applications. The first is a longitudinal study of a network through 
time. The second is a statistical analysis of the text of Tirant lo Blanc. 

5.1. Friendship Network. The MIT Media Laboratory conducted a study 
following 90 subjects, consisting of students and staff at the university, using 
mobile phones with pre-installed software recording call logs from July 2004 
to June 2005 (Eagle, Pentland and Lazer, 2009). In this analysis, we extract 
the information on the caller, callee and time for every call that was made 
during the study period. The question of interest is whether phone call 
patterns changed during this time, which may reflect a change in relationship 
among these subjects. We bin the calls by day and, for each day, construct 
a network with the 90 subjects as nodes and a link between two subjects 
if they had at lease one call on that day. We encode the network of each 
day by an adjacency matrix, with 1 for element if there is an edge 
between subject i and subject j, and otherwise. Thus, the processed data 
are adjacency matrices, one for each day from 2004/7/20 to 2005/6/14. 
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Table 6 

Number of simulated sequences (out of 100) with significance less than 



Normal data, E = 7 



d 


1 


5 10 


20 


50 


75 


100 


125 


150 


175 


A 


0.5 


0.65 0.8 


0.8 


1 


1.2 


1.2 


1.4 


1.6 


2 


T2 


81 


85 98 


76 


82 


90 


74 


72 


67 


46 


GLR 


72 


51 28 


8 


8 












MST 


14 


20 18 


15 


29 


45 


34 


42 


47 


73 


MDP 


5 


7 12 


8 


19 


23 


15 


18 


27 


43 


NNG 


8 


16 16 


16 


33 


49 


37 


46 


48 


77 


Normal data, 


S is diagonal with E[l, 1] 


= d 




i\ = 


l,i = 2. 


..,d. 






d 


1 


5 


10 


20 












A 


0.5 


0.4 


0.1 


0.2 












T2 


80 


18 


8 


3 












GLR 


76 


80 


67 


31 












MST 


8 


27 


35 


65 












MDP 


9 


18 


22 


30 












NNG 


10 


17 


34 


59 









Log-normal data, S = 7. 



d 


1 


5 


10 


20 


50 


75 


100 


A 


0.7 


0.9 


1 


1 


1.2 


1.4 


1.4 


T2 


83 


77 


79 


58 


60 


43 


29 


GLR 


28 


21 


18 


12 


7 






MST 


18 


35 


47 


28 


31 


62 


71 


MDP 


7 


15 


27 


14 


27 


22 


25 




i!) 


.31 


:!!) 


28 


;]i 


58 


71 
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We show results for graphs constructed using two different dissimilarity 
measures. Let Ai be the 90 by 90 adjacency matrix on day i. We denote Vi 
to be the vector form of Ai. The dissimilarities are: 

(1) the number of different edges: \\vi — Vj\\i = \\vi — Vj\\2, 

(2) the number of different edges, normalized by the geometric mean of the 
total for each day: J^^=Sdk=. 

Results based on different dissimilarities and different ways of construct- 
ing the graph are shown in Figure 3. We see that statistics based on MST 
and NNG give similar results under both dissimilarities. The statistic based 
on MDP is not informative here, possibly because MDP is not dense enough 
to capture the information in the data. Based on the scans using MST and 
NNG, a change-point occurred at around 2005/1/9, which turns out to be 
the winter break for that year at MIT. The p-values for the scan based on 
MST and NNG under both dissimilarity measures are all < 0.0001, by both 
10,000 permutations and analytic approximation (3.9). Perhaps a change 
in courses after winter break changed the social organization among the 
subjects. 

5.2. Authorship Debate. Tirant lo Blanc, a chivalry novel published in 
1490, is considered to be one of the best known medieval works of literature 
in Catalan. The dedicatory letter at the beginning of the book states, 

... So that no one else can be blamed if any faults are found in this work, I, 
Joanot Martorell, knight, take sole responsibility for it, as I have carried out 
the task singlehandedly... 

However, the colophon at the end of the book states something different, 

... by the magnificent and virtuous knight, Sir Joanot Martorell, who because 
of his death, could only finish writing three parts of it. The fourth part, which 
is the end of the book, was written by the illustrious knight Sir Marti Joan de 
Galba. If faults are found in that part, let them be attributed to his ignorance... 

This inconsistency sparked a debate, still ongoing, about the authorship 
of Tirant lo Blanc since its publication. Opinions have mainly fallen into 
two camps, one favoring the single authorship by Joanot Martorell and the 
other favoring a change of author somewhere between chapters 350 and 400. 
(There are in total 487 chapters in the book.) One objective way to settle 
this debate is through the statistical analysis of word usage, which reflects 
the unique style of writing of different people. 

Giron, Ginebra and Riba (2005) analyzed two sets of word usage statistics 
extracted from the book. The first, which we call the word length data set, 
categorizes the words in each chapter by its length, with a single category for 
all words with length greater than nine letters. Thus, this data set represents 
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MST, t=174 



MDP, t=197 



NNG,t=174 




50 100 200 300 
Day 

MST, t=175 



50 100 200 300 
Day 

MDP, t=288 



50 100 200 300 
Day 

NNG, t=175 




50 100 200 300 
Day 



50 100 200 300 
Day 



50 100 200 300 
Day 



Fig 3. Results of graph-based scans of the MIT phone call network. Top row shows results 
from using number of different edges as the dissimilarity measure and bottom row shows 
results from using the normalized number of different edges. The three columns show three 
different ways of constructing the graph: MST, MDP, and NNG from left to right. In each 
plot, Z(t) is plotted along t. The estimated change-point is shown m the caption above the 
plot. The two vertical lines show no and ni; we basically excluded the first 5% and the last 
5% of the points. The horizontal hnes show critical values at 0.05 and 0.01 significance 
levels, with the solid lines showing critical values computed from 10,000 permutations and 
the dashed lines showing those computed from the analytic approximation after skewness 
correction. 
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each chapter by a vector of length 10. The second, which we call the context- 
free word frequency data set, counts the occurrence of the 25 most frequent 
context-free words in each chapter. Giron, Ginebra and Riba (2005) analyzed 
the two data sets using a Bayesian multinomial change-point model and a 
Bayesian clustering method, and concluded in favor of the change of author 
hypotheses with the estimated change-point between chapters 371 and 382. 

Here, we apply the graph based change-point method to the two data 
sets, treating each chapter as a time-point. There are in total 487 chapters, 
and we use the 425 chapters that have more than 200 words. For both 
data sets, we normalized the count vector for each chapter by dividing by 
the total number of words in the chapter. Thus, our data is a sequence 
of 425 normalized frequencies, of dimension 10 for the word length data 
and dimension 25 for the context-free word frequency data. The L2 norm is 
used to construct the MST, MDP and NNG graphs representing similarity 
between chapters. Z{t) and the estimated change-points, computed for each 
type of graph, are shown in Figure 4. Test results using the three different 
graphs and the two data sets support the change of author hypothesis, with 
the estimated change-point around chapter 360, which is consistent with the 
view that there is a change of author somewhere between chapters 350 and 
400. The p- values are shown in Table 7. 

Table 7 

P -values for the tests. In each cell, the first value is calculated from 10,000 permutations 
and the second value is calculated from the analytic approximation after skewness 



correction. 


data 


MST 


MDP 


NNG 


word length 


0.0000/0.0000 


0.0041/0.0018 


0.0000/0.0000 


context-free word frequency 


0.0000/0.0000 


0.0000/0.0000 


0.0000/0.0000 



To check the robustness of our analysis, we also applied the scan on data 
for the first 350 chapters to see if it rejects the null there. Opinions seem 
to be quite uniform that the first 350 chapters were all written by Joanot 
Martorell. The results are shown in Figure 5. The word length data does not 
reject the null for the 350 chapters at 0.05 significance level. However, the 
context-free word frequency data supports a change-point, although different 
graphs favor diff'erent locations for the change-point. The p- values of the tests 
are shown in Table 8. These results suggest that word length may be more 
robust than context-free word frequency in refiecting writing styles. 

6. Conclusions and Discussion. We have described a new nonpara- 
metric method for change-point detection that can be applied to high di- 
mensional and non-Euclidean data. The method requires only the existence 
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MST, t=320/371 MDP, t=31 4/365 NNG, t=320/371 




Index Index Index 



MST, t=306/356 MDP, t=31 3/364 NNG, t=306/356 




100 200 300 400 100 200 300 400 100 200 300 400 



Index Index Index 

Fig 4. Results of graph-based scans of chapter-wise word usage frequencies of Tirant lo 
Blanc. The first row shows results from the word length data and the second row shows 
results from the context-free word frequency data. The three columns show scans based on 
three different graphs: MST, MDP, and NNG from left to right. The content in each plot is 
the same as in Figure 3. In the caption for each plot, the estimated change-point is shown 
in the form A/B, where A is the index of the change-point within the 425 chapters used 
for analysis, and B is the chapter number in the novel. 



Table 8 

P-values for the tests only using data from the first 350 chapters. Numbers in each cell 



have the 


same meaning as 


in Table 7. 




data 


MST 


MDP 


NNG 


word length 


0.0538/0.0562 


0.1061/0.1040 


0.3086/0.3527 


context-free word frequency 


0.0000/0.0000 


0.0019/0.0009 


0.0000/0.0000 
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MST, t=140/169 




MST, t=159/191 




50 100 150 200 250 300 



MDP, t=78/100 




50 100 150 200 250 300 
Index 



MDP, t=92/113 




50 100 150 200 250 300 
Index 



NNG,t=140/169 




50 100 150 200 250 300 
Index 



NNG,t=148/177 



50 100 150 200 250 300 



Fig 5. Results from the first 350 chapters. The setting of this figure is the same as in 
Figure 4- 
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of a dissimilarity measure on the sample space. In applications, the choice 
of a good dissimilarity measure is critical, and domain knowledge should be 
used to design a measure that is sensitive to the signal of interest. The ap- 
proach we propose decouples this application-specific choice of dissimilarity 
measure from the formal test for a change-point. Graph-based scan statistics 
are easy to compute, and the analytic p- value approximations are generally 
applicable. 

We have shown that the p-value approximations are quite accurate. Our 
simulations were for a data sequence of length n = 1000. The accuracy 
of the approximations depend on uq (Iq for the changed interval alterna- 
tive) and not so much on n. Accuracy also depends on the structure of 
the graph. When the graph is highly star-shaped, which is common for 
high-dimensional data when the Euclidean distance is used in construct- 
ing the graph, the skewness correction is critical for the approximations to 
be accurate. For extremely star-shaped graphs, we imagine that adjusting 
for kurtosis and higher order moments might also be helpful. The strategy 
would be similar to skewness correction, but more technically complicated. 
We don't compute these higher order terms in this paper, but if needed they 
can be computed in a similar fashion as the skewness term with the aid of 
a symbolic computation software. 

The main reason that higher order corrections are necessary in high di- 
mensions is the increase in size and density of hubs in the graph, as shown 
in Section 3.3. If hubs dominate the topology of the graph, perturbation of 
any hub can change the topology drastically. Furthermore, R{t), which does 
not take into account the interaction between edges, loses all information 
regarding the high order structure. Under such circumstances, the particular 
graph would not be useful for separating Fi from Fq, and we would suggest 
exploring other dissimilarity measures and graph construction methods. 

Compared to parametric approaches, the graph-based approach requires 
much less assumptions, but also makes less use of the data. Although this 
leads to loss of power in low dimensions if the data indeed follow the para- 
metric model, it leads to robustness and wider applicability. An important 
observation is that the graph-based approach has desirable power, compared 
to standard parametric tests, in moderate and high dimensions. For high di- 
mensional data, it is often hard to predict the direction and nature of the 
change. Without such prior knowledge, parametric models would require the 
estimation of many parameters, most of which would be unrelated to the 
change. For example, the Hotelling statistic requires the estimation of 
the large covariance matrix. If, by prior knowledge or data pre-processing, 
we can circumvent the covariance estimation, then Hotelling would be 
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preferable when the data satisfies its assumptions - normahty with no change 
of variance. Otherwise, graph-based approaches gain increasing advantage 
over Hotelhng's as d increases, even in the problem for which Hotelling's 
was explicitly designed. 

We mainly explored three different ways of constructing the underlying 
graph given a dissimilarity measure. From the numerical results and the 
analysis of the MIT cell phone network, we see that scans based on MST and 
NNG perform similarly, while scans based on MDP have lower power. We 
suspect this is due to the fact that MDP is the least dense graph and utilizes 
the least amount of information from the original data set. In this regard, 
one may try denser graphs which retain more information from the data 
than the MST and NNB. One may even consider assigning weights to the 
edges. As in all problems, building more assumptions into the statistic leads 
to improved power if the assumptions are true, but sacrifices robustness. 

If more than one change-point or changed interval were of interest, the 
graph-based scan can be applied recursively in a procedure that is called bi- 
nary or circular binary segmentation (Vostrikova, 1981; Olshen et al., 2004). 
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APPENDIX A: PROOFS FOR LEMMAS AND PROPOSITIONS 

A.l. Proof for Lemma 2.1. The formula for the expectation is im- 
mediate, 

nm]= Yl pi9iit)^9,{t))=p^it)\£G\. 

For the second moment, 

m\t)] = Yl ^(9iit) + 9jit),gkit) + giit)). 

{i,j),(k,l)&£G 

Since 



F[gi{t)^gj{t),gk{t)^gi{t)] 
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we have 

72 



2t{n-t) 
n(n— 1) 



t{n-t) 
n(n— 1) 



U{t-l){n-t){n-t-l) 
n(n-l)(n-2)(n-3) 



E[i?2(t)]= pi{t)+ E 



i = k,j = I 
i = l,j = k 
i = k,j y^l 

j = k,i^l 
. J = /,i / A; 
: P2(^) if hj,k,l all different, 



if 



if < 



E 



P2(i) 



P2(t)|^G| + 



i,j,k,l all different 



2^1(0 -P2(t) 



^|f,p+p2(t)|£:Gp. 



Var[i2(t)] follows from B[R'^{t)] - E'^[R{t)]. 

A. 2. Proof of Proposition 3.1. The proof is a straightforward ap- 
plication of the method of Woodroofe (1976, 1978), which had been sub- 
sequently refined and applied to many settings. See Chan, Tu and Zhang 
(2009) for a review of this and related methods. We sketch the derivation 
below. 

P( max Z{t) > b) 

no<t<ni 

(A.l) 

= E 

no<t<ni 



PiZit) = b + dx)P{ max Z{s) < b\Z{t) = b + dx) 

no<s<t<ni 



(A.2) 

b 



2-K 



P( max {Z{s) - Z{t)) < -^\Z{t) = b+ ^)dx 

n(i<s<t<ni b b 



no<t<ni 



e-^P( max b{Z{s) - Z{t)) < -x\Z{t) = b)dx. 

nci<s<t<ni 



The last approximation is valid for 6—7-00. 

We rescale time and let v = s/n, u = t/n. pu{5) = Cor [Z(n(ti — 
5)),Z(nii)], so Cor[Z(s), Z(t)] = Pu{u — v). By first order Taylor approxi- 
mation of Pu{u — v) sX V = we have, for v close to n, 

b[Z{nv)-Z{nu)\\Z{nv) = b ~ N [-{u - v)p'^{{)-)b'^ ,2{u - v)p'^{<d-)b^) , 
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where Pu(0~) is defined in (3.6). The left- and right- derivatives are equal 
in absolute value and opposite in sign. Thus for k = . . . , —2, —1, 0, 1, 2, ... , 
conditioned on Z{t), for large n, the process b[Z{s + k) — Z{t)] behaves like a 
random walk with drift —\k\p'^{0~)b'^ and variance twice the negative drift. 
One can show that, for b = bo^/ri and n oo, 

lim limsup V > b\Z{t) = b + dx) = 0. 



fc->oo n— >-oo 



\i-t\>k 



Thus, let Wi*^ be a random walk with V^f ^ ~ N{n^^\ (cr^*))^), where 
p[{0~)b'^, (cT^*^)^ = 2fi^^\ Under the conditions for Proposition 3.1, 



max^ b{Z{s) - Z{t)) < -x\Z{t) = 6 ) ~ P ( max -W/!^, < -x 



P 

(A.3) 



no<.s<t<.ni 



no<s<t 



- P ( minWi*) > 



m>l 



(3.3) follows by combining (A. 2) and (A.3), and the result 

POO 

/ e"^/p(minVFm > x)dx = fiu{2fj,/a) 
Jo ™>i 

for discretized standard Brownian motion Wm (see Siegmund (1992)) . 
Now we calculate p^(0~). Note that for s < t, we have 



P[gi{s)^g,{s),gk{t)^gi{t)] 

2s(n-t) 
n{n—l) 

s{n-t){n+2t-2s-2) 



2s(n—t) ( ,\ 



n(n-l)(n-2) 921,5, Ij 



4s(n-f)[(g-l)(n-s-l)+(i-5)(n-s-2)] 
n(n-l)(n-2)(n-3) 



k,i = 1 
l,j = k 

- k,i 

qs^s, t) if k, I all different. 



if 



if < 



Thus, 



E[Ris)R{t)] = Yl ^i9^{s) + gj{s),gk{t) + gm\ 

{i,j),ik,i)e£G 

ii,j),ik,l)G£G 
i,j,k,l all different 
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{qi{s,t) - 2q2{s,t) + q3{s,t))\£G\ + {q2{s,t) - q3{s,t))Y,\£i\^ + Q3{s,t)\£G\ 



i=l 



and 



Puiu -v) = Cor{Q{nv), Q{nu)) = Cor(Z(s), Z{t)) 
_ E[R{s)R{t)]-B[R{s)]B[R{t)] 
^Yar[R{s)]Yar[R{t)] 

^ {qi{s,t) - 2q2{s,t)+q3{s,t))\£G\ + {q2{s,t) - q3{s,t))Ztl l^il^ 

^Var(i?(s))Var(i?(t)) 
, iq3is,t)-pi{s)pim£G\^ 
y^Var{R{s))Var{R{t)) 

After replacing s,t by nv,nu, respectively, d-Pu{0) can be obtained. 

A. 3. Proof of Proposition 3.2. The proof of this proposition uses the 
method of Siegniund (1988, 1992). We omit most of the technical details, 
which follow these two papers given that ^2) (defined below) is 

differentiable with the derivative being continuous except at (5i = and at 
^2 = 0. 

Let ui = ti/n,U2 = t2/n, 

P{ui,u2)i^i^^2) = Cor(Z(n(ui - 6i),n{u2 - 62)), Z{nui,nu2)). 
A key intermediate form of the approximation is 



P 

(A.4) 



max Z{ti, 12) > b 



Ciih,t2)b^C2ih,t2)b^ 



X u 



y2Cl{h,t2Wj V [V'iC2{tl,t2Wj , 

where Ci, C2 are the partial derivatives 



Ci{nui,nu2) 



1 9-P(«i,«2)('^l,0) 



n 



d5i 



5i=0 



1^ d+P{m,u2){^i^^) 
n d6i 



5i=0 



C2{nui,nu2) 



1 '9+P(«i,«2)(0''^2) 



n 



dd2 



62=0 



Under the permutation null, the processes derived from perturbation of the 
left and right end points, 



Z{ti + k,t2), A: = ...,-2,-1,0,1,2,... 



34 



HAO CHEN AND NANCY ZHANG 



and ^ 

Z{h,t2-k), k = ...,-2,-1,0,1,2,..., 

are identical in distribution to the process 

Z{t2-ti-k), k = ...,-2,-1,0,1,2,..., 

where Z{t) is the standardized statistic for a one change-point scan (2.4). 
Thus, the partial derivatives are equal to the derivative in the one change- 
point scenario, 

Ci{ti,t2) = C2{h,t2)=np'^^_^^{0'). 

Substituting n/9^2-(ti(0 ) ^'1(^15^2) and C2{ti,t2) in (A. 4) and approxi- 
mating the double summation by an integral yields the final approximation. 

APPENDIX B: SKEWNESS CORRECTION 

B.l. Derivation of (3.9). In the analytic approximations (3.3) and 
(3.8), the marginal distribution of the process is assumed to be Gaussian. 
To correct for the skewness of the marginal distribution of the process, we 
adopt a similar method as that used by Tang and Siegmund (2001). In the 
derivation below we suppress the dependence on the time parameter t. 

Consider the probability measure dQg = e^^~'^^^')dP , where ip{0) = 
Ep[e^^]. Choose Ob such that ipiOb) = Eq^JZ] = b. Then, 

(B.l) 

P(Z e 6 + dx/b) = Ep[lzeb+d./b] « e-'^^'+-/'^^^^'^^Qe,{Z G b + dx/b). 

Since under Qg^^, Z is centered at b with variance i^{9b), Qeti^ G & + dx/b) 
can be approximated by the normal density, 

(B.2) Qe,{Zeb + dx/b)^ \ exp( ^ ) « ^^=. 

y2Tri;{9b) V ^b^i^i^b) J ^2^V>(06) 

The second approximation above is accurate for x/b — t- 0. 

To obtain tp{6b) and ipiOb), we use Taylor expansions, noting that ipiO) = 
^(0) = 0,V'(0) = 1,V^(0) = Ep[Z3] ^ 7: 

(B.3) m « m + me + m^-^ + m^-^ = y + y 

(B.4) ^(^) w V'(O) + Vi(0)^ = 1 +7^. 
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Combining (B.l), (B.2),(B.3) and (B.4) gives 

(B.5) P{Zeb + dx/b) « 1 ^-e,b-xe,/b+eiii+ye,/3)/2_ 

For an approximation of Of^, we solve ip{Oi,) by approximating ip up to the 
third order, 

(B.6) b = i,{e,) « V'(o) + mOb + = 0b + \iel 

yielding 

(B.7) « (-1 + + 27&)/7- 

Note that when 7 = 0, 0b = 6. (3.9) follows by using (B.5) in (A.l) in the 
proof of Proposition 3.1 and approximating the Obx/b term in the exponent 
by X. 

B.2. Proof for Lemma 3.1 and Remark 3.4. For the un-centered 
process R{t), 

E[i?'(t)] = ^^9i{t) / 93{t),gk{t) / gi{t),gu{t) / g,{t)). 

{i,j),{k,l),{u,v)£SG 

There are in total eight different configurations for three edges randomly 
chosen from the graph, and we derive P{gi{t) / gj{t),gk{t) / gi{t),gu{t) / 

9v{t)) = P3 separately for each of them. 

1) The three edges are actually the same edge. 

P3 = Pb.(.)#*«)) = ^. 

2) Two edges are the same and share one node with the third edge. 

Ps = Pig.it) + gjit),g.it) + g^ = 4^^- 

n\n — \) 

3) Two edges are the same and do not share any node with the third edge. 

4t(t - l)(n - t)(n - i - 1) 



P3 = P(ffi(t) y^5,(t),5fc(t) 



n(n — l)(n — 2)(n — 3) 
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4) The three edges share one node, and neither of them share the other 
node (star-shaped). 

P3 = P(<7iW + 9j{t),gi{t) / gk{t),gi{t) / gi{t)) 
_ tin - t)[{n - t - l)(n - t - 2) + {t - - 2)] 
~ n(n- l)(n-2)(n-3) ' 

5) One edge share one node with another edge and share the other node 
with the third edge. No node sharing between the second and the third 
edge (hnear chain). 

P3 = P(ft(i) ^ 9j{t),9iit) 7^ gk{t),9j{t) 7^ gi{t)) 
_ 2t{t-l){n-t){n-t-l) 
~ n(n-l)(n-2)(n-3) 

6) The three edges form a triangle. 

P3 = Fi9i{t) + 9j{t),gj{t) + gk{t),gk{t) + giif)) = 0. 

7) Two edges share one node, and share no node with the third edge. 

Pa = P{gi{t) gj{t),gi{t) 7^ gkit),gu{t) 7^ g^{t)) 
_ 2t{t - l){n - t){n - t - 1) 
~ n(n-l)(n-2)(n-3) 

8) No pair of the three edges share any node. 

P3 = P{g^{t) / gj{t),gk{t) + gi{t),gu{t) / g^{t)) 
_ 8t{t -l){t- 2)(n -t){n-t-l){n-t-2) 
n{n - l)(n - 2)(n - 3)(n - 4)(n - 5) ' 

The following lists the number of occurrences of each of the above cases: 

1) \£g\ 

2) 3EJ^d(l^d-l) 

3) 3\SGmG\-i)-3Ei\Simi\-'^) 

4) Z^\m£^\-m£^\-2) 

7) 3EJ^i|(l^.| - m + e^iij)eSa\{k ^ ihk),{j,k) G ^g}! - 
12E(,,)e^^(l^^.|-l)(l^^.|-l) 

8) \£GmG\-m£G\-2) + 6Ei^,J)e£a(\^^\-m£J\-^)-^Zi^J)e£a\i''^■ 

{i, k), U, k) e Sg}\ - Ei \£i\{\£i\ - m\£G\ - m\ - 2) 
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The lemma follows by summing up all of the probabilities as enumerated 
above. 

For MDP, only cases 1, 3, and 8 are possible, and the number of occur- 
rences of each case is 

1) \£G\=n 

3) 3|fG|(|'?G| -1) = 3n(n-l) 

8) I^gKI^gI - IWg\ - 2) = n(n - l)(n - 2) 

Remark 3.4 follows immediately by summing the probabilities of these 3 
cases. 

B.3. Effect of Skewness and Extrapolation at Boundaries. To 

gain a better understanding of the role of skewness, we explore the following 
quantities involved in the p-value approximations: 

. 7(t) = E[Z3(t)], 

Figure 6 shows the three quantities versus t for the single change-point scan 
statistic on a MDP graph when n = 1000, 6 = 3. Since the structure of MDP 
is always the same and does not depend on the distribution of y^, Figure 6 
is representative of all MDP graphs with n = 1000 subjects and threshold 
6 = 3. We can see from the figure that 7 is always larger than 0, indicating 
right skewness. For 7 = 0, = 6; for ^ > 0, Of, < b. Since Z{t) is right- 
skewed, the analytic approximation of the p-value assuming Gaussianity is 
smaller than the actual p-value, so the skewness correction should increase 
the p- value approximation. This is indeed true as S{t) is U-shaped with a 
minimum of 1. 

Each node in the MDP has degree 1. The shapes of 7(i) and 9b{t) for 
Z{t) computed on graphs with very low number of hubs are similar to their 
shapes for Z{t) computed on the MDP. For example, for data in low dimen- 
sions (< 5), scans based on MST and NNG constructed based on Euclidean 
distance have similar skewness properties as described above. However, as 
the dimension of the data increases, MST and NNG constructed based on 
Euclidean distance tend to become dominated by hubs, and the distribution 
of Z(t) becomes left-skewed. For a left-skewed distribution, 7 < 0, > 6, 
and 5 < 1. One problem for left-skewed distributions is that if 7 is smaller 
than —1/(26), the current approximation yields no solution for Of,. This issue 
is discussed in Remark 3.3 and here we provide a heuristic solution to this 
problem based on an extrapolation procedure. 
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We illustrate procedure through a MST constructed on a simulated 100- 
dimensional data based on Euclidean distance. From Figure 7, we see that 
Oi,{t) and S{t) are not defined except in the middle region. In this case, 
the integrand, nS{nu)b'^p'^{0~)u^y2b'^p'^{0~), is directly extrapolated to the 
edge regions using the boundary tangent at each side. If extrapolation is 
negative, it is set to zero. Figure 8 illustrates the integrand before and after 
extrapolation. 



Skewness theta_b S(t) 




Fig 6. The three quantities, 7, 9i, and S from left to right, for a MDP graph, n = 1000, b — 
3. 



Skewness theta_b S(t) 




Fig 7. The three quantities, 7, 9b and S from left to right, for a MST graph constructed us- 
ing Euclidean distance on a sequence of n = 1000 observations iid drawn from N[0,lun{). 
b = 3. 



APPENDIX C: CHECKING ANALYTIC APPROXIMATIONS TO 
P- VALUES FOR THE CHANGED INTERVAL 
ALTERNATIVE 



Tables 9-13 show the results of value approximations for the changed 
interval alternative. The notation and simulation settings are identical to 
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0.0 



0.2 



0.4 



0.6 



0.8 



1.0 



0.0 



0.2 



0.4 



0.6 



0.8 



1.0 



t/n 



t/n 



Fig 8. The integrand before (left) and after (right) extrapolation. The integrand can only 
be directly calculated in the middle part (t £ [248, 752]), and the outer part is obtained by 
extending using the boundary tangent. 

those for the single change-point alternative in Section 3.3, except that no 
is replaced by Iq for the smallest window size, (/i is set to n — Iq.) 

From the tables, conclusions similar to those for the single change-point 
alternative can be drawn. The analytic approximation after skewness correc- 
tion performs much better than the analytic approximation under Gaussian 
assumption, especially when dimension increases. The accuracy of skew- 
corrected approximation does not degrade significantly with dimension. It 
does well for MST- and NNG- based tests when the smallest window size 
to be considered is as small as 25 for both 0.05 and 0.01 significance levels, 
and for MDP-based test when the smallest window size is 50. 



We thank David Siegmund, Jerome Friedman, and Susan Holmes for help- 
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analysis of Tirant lo Blanc. 
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Table 9 

Critical values for the changed interval scan statistic based on MST at 0. 05 significance 

level, n = 1000. 





Critical Values 


Graph 




lo = 100 


/() = 50 


lo = 25 








Al 


A2 




Ai 


A2 


P(-r 


Al 


A2 


I'cr 




D 


d=l 


4.08 


4.29 


4.24 


4.22 


4.76 


4.73 


4.33 


5.44 


5.77 


4994 


2 




3.97 


3.89 


3.84 


4.07 


3.92 


3.89 


4.16 


3.93 


3.89 


5454 


8 


N(0,1) 


3.97 


3.91 


3.81 


4.07 


3.95 


3.85 


4.16 


3.97 


3.87 


5400 


7 


rf= 10 


3.97 


3.90 


3.81 


4.07 


3.93 


3.90 


4.16 


3.94 


3.91 


5448 


8 




3.97 


3.90 


3.91 


4.07 


3.94 


3.93 


4.16 


3.95 


3.94 


5440 


7 




3.97 


3.89 


3.82 


4.07 


3.91 


3.85 


4.15 


3.93 


3.85 


5524 


8 




3.99 


3.93 


3.86 


4.09 


3.97 


3.92 


4.17 


3.99 


3.95 


5042 


8 


Exp(l) 


3.99 


3.93 


3.84 


4.09 


3.96 


3.90 


4.17 


4.00 


3.92 


5040 


6 


d= 10 


3.99 


3.93 


3.85 


4.09 


3.97 


3.91 


4.17 


4.00 


3.93 


5106 


6 




3.99 


3.93 


3.82 


4.09 


3.97 


3.87 


4.17 


3.99 


3.91 


5042 


6 




3.99 


3.91 


3.94 


4.08 


3.95 


3.98 


4.17 


3.97 


3.98 


5126 


8 




3.87 


3.51 


3.52 


3.98 


3.51 


3.52 


4.09 


3.51 


3.52 


11600 


40 


N(0,1) 


3.86 


3.49 


3.55 


3.98 


3.49 


3.55 


4.08 


3.49 


3.55 


13346 


64 


d= 100 


3.88 


3.57 


3.66 


3.99 


3.57 


3.66 


4.09 


3.57 


3.66 


10422 


34 




3.88 


3.57 


3.58 


3.99 


3.57 


3.58 


4.09 


3.57 


3.58 


10804 


43 




3.88 


3.56 


3.58 


3.99 


3.56 


3.58 


4.09 


3.56 


3.58 


10862 


36 




3.88 


3.63 


3.59 


3.99 


3.63 


3.59 


4.09 


3.63 


3.59 


10384 


24 


Exp(l) 


3.87 


3.58 


3.49 


3.98 


3.58 


3.49 


4.09 


3.58 


3.49 


11922 


33 


d= 100 


3.88 


3.60 


3.63 


3.99 


3.60 


3.63 


4.09 


3.60 


3.63 


11194 


34 




3.89 


3.63 


3.55 


4.00 


3.63 


3.55 


4.10 


3.63 


3.55 


9680 


27 




:!.88 


:!.G2 


:',.()() 


:5.l)!) 


:5.()2 


:5.()() 


1.09 


:!.()2 




i()l()8 


29 
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Table 10 

Critical values for the changed interval scan statistic based on MST at 0. 01 significance 

level, n = 1000. 





Critical Values 


Graph 




lo = 100 


lo = 50 


lo = 25 








Al 


A2 


Per 


Al 


A2 


Per 


Al 


A2 


Per 


pi 


B 


d=l 


4.ol 


4. r 


4. M 


A dQ 

4. DO 


o.oi 


c on 


4. / Z 


O.Uo 


o.oo 




9 




4.4z 


4.32 


4.31 


4.50 


4.33 


A OO 


4.58 


A OO 


4.33 


5454 


8 


NfO 1) 


4.42 


4.34 


A OO 


A K 1 


A oa 
4.0D 


A OK 


A CO 


4.0 ( 


A OK 
4.Z0 


c; A nn 
04UU 


/ 


d=W 


4.42 


4.33 


4.20 


4.50 


4.51 


4.25 


4.58 


4.35 


4.29 


C /I /I o 

5440 


o 





A AO 




A ^fi 




A ^9 


A ^fi 




A ^fi 


A ^fl 


5440 


ft 

7 




4.42 


4.32 


4.31 


4.50 


4.33 


4.32 


4.57 


4.33 


4.32 




a 

o 




4.43 


4.36 


4.36 


4.52 


4.39 


4.36 


4.59 


4.39 


4.36 


5042 


8 


Exp(l) 


4.43 


4.36 


4.30 


4.52 


4.39 


4.36 


4.59 


4.40 


4.36 


5040 


6 


rf= 10 


4.43 


4.36 


4.32 


4.52 


4.39 


4.38 


4.59 


4.40 


4.44 


5106 


6 




4.43 


4.36 


4.27 


4.52 


4.39 


4.33 


4.59 


4.39 


4.33 


5042 


6 




4.43 


4.35 


4.35 


4.52 


4.37 


4.35 


4.59 


4.37 


4.35 


5126 


8 




4.34 


3.99 


4.28 


4.43 


3.99 


4.28 


4.52 


3.99 


4.28 


11600 


40 


N(0,1) 


4.33 


3.98 


3.95 


4.42 


3.98 


3.95 


4.51 


3.98 


3.95 


13346 


64 


d= 100 


4.34 


4.04 


4.12 


4.44 


4.04 


4.12 


4.52 


4.04 


4.12 


10422 


34 




4.34 


4.05 


4.22 


4.43 


4.05 


4.22 


4.52 


4.05 


4.22 


10804 


43 




4.34 


4.03 


4.00 


4.43 


4.03 


4.00 


4.52 


4.03 


4.00 


10862 


36 




4.34 


4.10 


3.95 


4.44 


4.10 


3.95 


4.52 


4.10 


3.95 


10384 


24 


Exp(l) 


4.33 


4.05 


3.87 


4.43 


4.05 


3.87 


4.52 


4.05 


3.87 


11922 


33 


d= 100 


4.34 


4.08 


4.14 


4.43 


4.08 


4.14 


4.52 


4.08 


4.14 


11194 


34 




4.35 


4.10 


3.86 


4.44 


4.10 


3.86 


4.53 


4.10 


3.86 


9680 


27 




4.34 


4.08 


4.10 


4.44 


4.08 


4.10 


4.52 


4.08 


4.10 


10468 


29 



Table 11 

Critical values for the changed interval scan statistic based on MDP. n = 1000. 



significance level = 0.05 







d=l 


d= 10 


d= 100 


lo 


Al A2 


N(0,1) Exp(l) 


N(0,1) Exp(l) 


N(0,1) Exp(l) 


100 


4.08 4.38 


4.39 4.46 


4.30 4.29 


4.32 4.32 


50 


4.22 4.97 


5.03 5.12 


5.10 4.87 


5.19 4.99 


25 


4.33 5.81 


6.31 6.32 


6.14 6.12 


6.60 6.35 



significance level = 0.01 







d = 1 


d = 10 


d = 100 


/„ 


Ai A2 


K(O.l) Exp(i) 


K(O.i) Exp(i) 


S{{)A) Exi)(l) 


100 


4.51 4.90 


4.91 5.13 


4.93 4.92 


5.01 4.91 


50 


4.63 5.58 


5.63 5.94 


5.64 5.48 


6.13 5.63 


25 


i.72 0.52 


().9i (i.Di 


0.91 G.Di 


7.12 G.9i 
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Table 12 

Critical values for the changed interval scan statistic based on NNG at 0.05 significance 

level, n = 1000. 





Critical Values 


Graph 




lo = 100 


lo = 50 


lo = 25 








Al 


A2 


Per 


Al 


A2 


Per 


Al 


A2 


Per 


D 




4.04 


4.10 


4.07 


4.15 


4.23 


4.20 


4.23 


4.31 


4.30 


2026 


2 


N(0,1) 


4.04 


4.10 


4.09 


4.15 


4.24 


4.18 


4.23 


4.31 


4.24 


1942 


2 


d = l 


4.04 


4.10 


4.11 


4.15 


4.24 


4.23 


4.23 


4.31 


4.35 


1948 


2 




4.04 


4.10 


3.96 


4.15 


4.23 


4.11 


4.23 


4.31 


4.25 


2038 


2 




4.04 


4.10 


4.04 


4.15 


4.24 


4.17 


4.23 


4.31 


4.31 


1960 


2 




4.04 


4.10 


4.00 


4.15 


4.23 


4.14 


4.23 


4.31 


4.24 


2086 


2 


Exp(l) 


4.04 


4.10 


4.08 


4.15 


4.23 


4.20 


4.23 


4.31 


4.24 


1990 


2 


d = l 


4.04 


4.10 


4.00 


4.15 


4.24 


4.15 


4.23 


4.32 


4.27 


2014 


2 




4.04 


4.10 


4.01 


4.15 


4.23 


4.20 


4.23 


4.31 


4.34 


2080 


2 




4.04 


4.10 


4.04 


4.15 


4.23 


4.18 


4.23 


4.31 


4.27 


2008 


2 




3.99 


3.92 


3.82 


4.09 


3.96 


3.88 


4.18 


3.97 


3.90 


3558 


6 


N(0,1) 


3.99 


3.91 


3.86 


4.09 


3.94 


3.86 


4.18 


3.95 


3.88 


3508 


6 


d= 10 


4.00 


3.92 


3.86 


4.10 


3.96 


3.93 


4.18 


3.97 


3.93 


3394 


6 




3.99 


3.91 


3.81 


4.09 


3.94 


3.86 


4.18 


3.95 


3.90 


3418 


6 




3.99 


3.91 


3.88 


4.09 


3.94 


3.88 


4.18 


3.95 


3.88 


3450 


6 




4.00 


3.94 


3.85 


4.10 


3.98 


3.91 


4.18 


3.99 


3.91 


3306 


6 


Exp(l) 


4.01 


3.95 


3.91 


4.11 


4.00 


3.98 


4.19 


4.02 


3.99 


3118 


5 


d= 10 


4.00 


3.94 


3.89 


4.10 


3.98 


3.93 


4.19 


4.00 


3.94 


3018 


5 




4.00 


3.95 


3.90 


4.11 


3.99 


3.93 


4.19 


4.01 


3.93 


3014 


5 




4.01 


3.96 


3.95 


4.11 


4.01 


3.97 


4.19 


4.03 


3.99 


3092 


5 




3.89 


3.55 


3.48 


4.00 


3.55 


3.48 


4.10 


3.55 


3.48 


8240 


30 


N(0,1) 


3.88 


3.50 


3.49 


3.99 


3.50 


3.49 


4.09 


3.50 


3.49 


9360 


33 


d= 100 


3.90 


3.61 


3.60 


4.00 


3.61 


3.60 


4.10 


3.61 


3.60 


8482 


18 




3.88 


3.51 


3.48 


3.99 


3.51 


3.48 


4.09 


3.51 


3.48 


9154 


40 




3.88 


3.50 


3.44 


3.99 


3.50 


3.44 


4.09 


3.50 


3.44 


9392 


39 




3.88 


3.54 


3.47 


3.99 


3.54 


3.47 


4.09 


3.54 


3.47 


10406 


45 


Exp(l) 


3.88 


3.55 


3.55 


3.99 


3.55 


3.55 


4.09 


3.55 


3.55 


10504 


44 


d= 100 


3.88 


3.54 


3.61 


3.99 


3.54 


3.61 


4.09 


3.54 


3.61 


10106 


32 




3.90 


3.64 


3.53 


4.00 


3.63 


3.53 


4.10 


3.63 


3.53 


8666 


22 




3.90 


3.58 


3.57 


4.00 


3.58 


3.57 


4.10 


3.58 


3.57 


8274 


28 
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Table 13 

Critical values for the changed interval scan statistic based on NNG at 0.01 significance 

level, n = 1000. 





Critical Values 


Graph 




lo = 100 


lo = 50 


lo = 25 








Al 


A2 


Per 


Al 


A2 


Per 


Al 


A2 


Per 


D 




4.48 


4.55 


4.58 


4.57 


4.67 


4.65 


4.64 


4.73 


4.65 


2026 


2 


N(0,1) 


4.48 


4.56 


4.53 


4.57 


4.68 


4.71 


4.64 


4.74 


4.79 


1942 


2 


d = l 


4.48 


4.56 


4.56 


4.57 


4.68 


4.72 


4.64 


4.74 


4.83 


1948 


2 




4.48 


4.55 


4.45 


4.57 


4.67 


4.68 


4.64 


4.74 


4.69 


2038 


2 




4.48 


4.56 


4.56 


4.57 


4.68 


4.66 


4.64 


4.74 


4.82 


1960 


2 




4.48 


4.55 


4.49 


4.57 


4.67 


4.62 


4.64 


4.74 


4.68 


2086 


2 


Exp(l) 


4.48 


4.55 


4.49 


4.57 


4.67 


4.57 


4.64 


4.73 


4.57 


1990 


2 


d = l 


4.48 


4.56 


4.49 


4.57 


4.68 


4.59 


4.64 


4.75 


4.60 


2014 


2 




4.48 


4.55 


4.61 


4.57 


4.67 


4.65 


4.64 


4.74 


4.76 


2080 


2 




4.48 


4.55 


4.60 


4.57 


4.67 


4.65 


4.64 


4.73 


4.78 


2008 


2 




4.44 


4.35 


4.20 


4.52 


4.39 


4.25 


4.60 


4.37 


4.25 


3558 


6 


N(0,1) 


4.44 


4.34 


4.34 


4.52 


4.35 


4.38 


4.59 


4.35 


4.38 


3508 


6 


d= 10 


4.44 


4.35 


4.28 


4.52 


4.36 


4.33 


4.60 


4.37 


4.33 


3394 


6 




4.44 


4.34 


4.30 


4.52 


4.36 


4.30 


4.59 


4.35 


4.30 


3418 


6 




4.44 


4.34 


4.22 


4.52 


4.35 


4.22 


4.59 


4.35 


4.22 


3450 


6 




4.44 


4.37 


4.31 


4.53 


4.43 


4.39 


4.60 


4.39 


4.39 


3306 


6 


Exp(l) 


4.45 


4.38 


4.39 


4.53 


4.42 


4.50 


4.60 


4.42 


4.50 


3118 


5 


d= 10 


4.45 


4.37 


4.31 


4.53 


4.72 


4.33 


4.60 


4.39 


4.38 


3018 


5 




4.45 


4.38 


4.42 


4.53 


4.35 


4.45 


4.60 


4.41 


4.45 


3014 


5 




4.45 


4.39 


4.46 


4.53 


4.43 


4.47 


4.61 


4.43 


4.47 


3092 


5 




4.35 


4.02 


3.91 


4.44 


4.02 


3.91 


4.53 


4.02 


3.91 


8240 


30 


N(0,1) 


4.34 


3.97 


3.82 


4.44 


3.97 


3.82 


4.52 


3.97 


3.82 


9360 


33 


d= 100 


4.36 


4.07 


3.94 


4.45 


4.07 


3.94 


4.53 


4.07 


3.94 


8482 


18 




4.35 


3.99 


4.06 


4.44 


3.99 


4.06 


4.52 


3.99 


4.06 


9154 


40 




4.34 


3.98 


3.83 


4.44 


3.98 


3.83 


4.52 


3.98 


3.83 


9392 


39 




4.34 


4.02 


3.87 


4.43 


4.02 


3.87 


4.52 


4.02 


3.87 


10406 


45 


Exp(l) 


4.34 


4.03 


3.99 


4.43 


4.03 


3.99 


4.52 


4.03 


3.99 


10504 


44 


d= 100 


4.34 


4.02 


4.22 


4.43 


4.02 


4.22 


4.52 


4.02 


4.22 


10106 


32 




4.35 


4.10 


3.95 


4.44 


4.10 


3.95 


4.53 


4.10 


3.95 


8666 


22 




4.36 


4.05 


4.02 


4.45 


4.05 


4.02 


4.53 


4.05 


4.02 


8274 


28 



